local detail
MR. Video: MapReduce as an Effective Principle for Long Video Understanding
The fundamental challenge of long video understanding, e.g., question answering, lies in the extensive number of frames, making it infeasible to densely understand the local details while comprehensively digest the global contexts, especially within a limited context length. To address this problem, our insight is to process short video segments individually and combine these segment-level analyses into a final response. This intuition is noted in the well-established MapReduce principle in big data processing and is naturally compatible with inference scaling at the system level. Motivated by this, we propose MR. Video (pronounced as mister video), a long video understanding framework adopting the MapReduce principle. We define the standard operations of MapReduce in a long video understanding context: the Map steps conduct independent and sequence-parallel dense perception on short video segments, covering local details, while the Reduce steps comprehensively aggregate the segment-level results into an answer with global contexts.
GLoD: Composing Global Contexts and Local Details in Image Generation
MultiDiffusion [Bar-Tal et al., 2023] places an object with specified details on a certain region using segmentation Diffusion models have demonstrated their capability masks and a prompt for each segment. These methods to synthesize high-quality and diverse images work without requiring any additional training; however, they from textual prompts. However, simultaneous control struggle to control both the global contexts (e.g., object interactions) over both global contexts (e.g., object layouts and the local details (e.g., object colors and emotions) and interactions) and local details (e.g., colors and simultaneously. With a complex prompt containing emotions) still remains a significant challenge. The multiple objects, the models often misinterpret specified local models often fail to understand complex descriptions details, directing them to the wrong target or ignoring them, involving multiple objects and reflect specified similar to the issues observed in Stable Diffusion [Rombach visual attributes to wrong targets or ignore et al., 2022]. While splitting the complex prompt into multiple them. This paper presents Global-Local Diffusion prompts allows the model to depict each object more (GLoD), a novel framework which allows simultaneous accurately, handling the prompts independently poses limitations control over the global contexts and the local in addressing a global context that describes interactions details in text-to-image generation without requiring and relationships between the multiple objects.
Magic-Boost: Boost 3D Generation with Mutli-View Conditioned Diffusion
Yang, Fan, Zhang, Jianfeng, Shi, Yichun, Chen, Bowen, Zhang, Chenxu, Zhang, Huichao, Yang, Xiaofeng, Feng, Jiashi, Lin, Guosheng
Benefiting from the rapid development of 2D diffusion models, 3D content creation has made significant progress recently. One promising solution involves the fine-tuning of pre-trained 2D diffusion models to harness their capacity for producing multi-view images, which are then lifted into accurate 3D models via methods like fast-NeRFs or large reconstruction models. However, as inconsistency still exists and limited generated resolution, the generation results of such methods still lack intricate textures and complex geometries. To solve this problem, we propose Magic-Boost, a multi-view conditioned diffusion model that significantly refines coarse generative results through a brief period of SDS optimization ($\sim15$min). Compared to the previous text or single image based diffusion models, Magic-Boost exhibits a robust capability to generate images with high consistency from pseudo synthesized multi-view images. It provides precise SDS guidance that well aligns with the identity of the input images, enriching the local detail in both geometry and texture of the initial generative results. Extensive experiments show Magic-Boost greatly enhances the coarse inputs and generates high-quality 3D assets with rich geometric and textural details. (Project Page: https://magic-research.github.io/magic-boost/)
Aesthetic Photo Collage with Deep Reinforcement Learning
Zhang, Mingrui, Li, Mading, Chen, Li, Yu, Jiahao
Photo collage aims to automatically arrange multiple photos on a given canvas with high aesthetic quality. Existing methods are based mainly on handcrafted feature optimization, which cannot adequately capture high-level human aesthetic senses. Deep learning provides a promising way, but owing to the complexity of collage and lack of training data, a solution has yet to be found. In this paper, we propose a novel pipeline for automatic generation of aspect ratio specified collage and the reinforcement learning technique is introduced in collage for the first time. Inspired by manual collages, we model the collage generation as sequential decision process to adjust spatial positions, orientation angles, placement order and the global layout. To instruct the agent to improve both the overall layout and local details, the reward function is specially designed for collage, considering subjective and objective factors. To overcome the lack of training data, we pretrain our deep aesthetic network on a large scale image aesthetic dataset (CPC) for general aesthetic feature extraction and propose an attention fusion module for structural collage feature representation. We test our model against competing methods on two movie datasets and our results outperform others in aesthetic quality evaluation. Further user study is also conducted to demonstrate the effectiveness.
Hierarchical Autoregressive Image Models with Auxiliary Decoders
De Fauw, Jeffrey, Dieleman, Sander, Simonyan, Karen
Autoregressive generative models of images tend to be biased towards capturing local structure, and as a result they often produce samples which are lacking in terms of large-scale coherence. To address this, we propose two methods to learn discrete representations of images which abstract away local detail. We show that autoregressive models conditioned on these representations can produce high-fidelity reconstructions of images, and that we can train autoregressive priors on these representations that produce samples with large-scale coherence. We can recursively apply the learning procedure, yielding a hierarchy of progressively more abstract image representations. We train hierarchical class-conditional autoregressive models on the ImageNet dataset and demonstrate that they are able to generate realistic images at resolutions of 128$\times$128 and 256$\times$256 pixels.